Natural Language Engineering Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text
نویسنده
چکیده
This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rarely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase. The paper presents a terminology identification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportion of the recovered strings are valid technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied. This paper outlines some linguistic properties of technical terms that lead to the formulation of a robust, domain-independent algorithm for identifying them automatically in continuous texts. In particular, it addresses multi-word noun phrase terms. Judging from data in dictionaries of technical vocabulary, the majority of technical terms do consist of more than one word; among these, the overwhelming majority are noun phrases, which constitute the vast majority of multi-word terminological units in probably all domains. t Current address: Weston Language Research, 138 Weston Road, Weston, CT 06883, USA. 10 John S. Justeson and Slava M. Katz The algorithm we present is quite simple conceptually, yet it performs very well: it recovers a high proportion of the valid terms in a text, and the proportion of nonterminological word sequences recovered is low. It has been tested on a variety of text types and domains. The list of candidate terms produced by the algorithm is useful for a variety of tasks in natural language processing, such as text indexing and construction of glossaries for translation. While 'technical terminology' is the fundamental notion of this paper, this notion has no satisfactory formal definition. It can be intuitively characterized: it generally occurs only in specialized types of discourse, is often specific to subsets of domains, and when it occurs in general types of discourse or in a variety of domains it often has broader or more diverse meanings. In this paper, we treat it as an undefined term for a basic, intuitively recognizable construct. The first part of the paper discusses some of the properties of technical terms, which provide the linguistic underpinnings of the algorithm: the patterns of use of terminological units in text (section 1), and the grammatical structures of these units (section 2). Section 3 describes an efficient implementation of these ideas. The performance of the algorithm is illustrated in section 4, in a detailed analysis of terms recovered from three recent papers in different domains. Section 5 relates this research to other recent work. 1 Repetition of technical terms Terminological noun phrases (NPs) differ from other NPs because they are LEXICAL they are distinctive entities requiring inclusion in the lexicon because their meanings are not unambiguously derivable from the meanings of the words that compose them. An example is central processing unit, whose referent is much more specific than the words themselves might suggest. Lexical NPs are subject to a much more restricted range and extent of modifier variation, on repeated references to the entities they designate, than are nonlexical NPs. This applies to variation in the omission of modifiers, in the insertion of modifiers, and in selection among alternative modifiers. This section outlines the differences and their sources. After an entity is introduced into a discourse via a nonlexical NP, it can be referenced simply by a noun or NP head of that phrase: such omission of modifying words and phrases is semantically neutral, if the meaning of a phrase is compositionally derivable from that of its head and those of its modifiers. In addition, an entity introduced by a nonlexical NP can be and often is reintroduced via a variety of other NPs. In fact, several factors promote variation and inhibit exact 1 We limit the class of MODIFIERS (as e.g. in Huddleston 1984:233-5) by excluding the general class of DETERMINERS, premodifiers that are applicable to virtually any NP, regardless of its meaning; unless otherwise stated, 'NP' is used in this paper to refer to the core of an NP, excluding its determiners. This is because determiners tend to inform discourse pragmatics rather than lexical semantics, or to serve as quantifiers (see section 2); these functions are generally applicable to all NPs, so the tendency of determiners to be repeated or not is independent of the lexical vs. nonlexical status of the NP they modify. Technical terminology 11 repetition of these NPs on repeated references. When an entity is introduced with one set of modifiers in a nonlexical NP, these modifiers typically function as means for specifying the entity or type of entity referred to, an aspect of the entity that is in focus, or an orientation to the entity. When this is the function of the NP's modifiers, inclusion of the same modifiers on a subsequent reference to the same entity is, usually, pragmatically anomalous. Accordingly, the typical follow-up reference to the entity is by a definite NP that was either a head of the original NP, or an approximate synonym for either the NP or its head. Repetition including the modifiers of a nonlexical NP can be appropriate pragmatically, when repetition of the specifying function is motivated; this can occur when the specified attribute is being emphasized, or when the referent of the NP is being distinguished from that of another NP with the same head. The more modifiers are involved, the less likely such possibilities are. Even when repetition of the full NP might be pragmatically appropriate, precise repetition can create a tedious or monotonous effect, the more so the longer the NP and the more recently the repeating phrase was used; some sort of stylistic variation is usual. Exact repetition of nonlexical NPs is expected to occur primarily either when they are widely separated in relatively large texts or else as an accidental effect. In contrast, omission of modifiers from a lexical NP normally involves reference to a different entity. Lexical NPs even those with compositional semantics are much less susceptible to the omission of modifiers. When a lexical NP has been used to refer to an entity, and that entity is subsequently reintroduced after an intervening shift of topic, the reintroduction of reference to it is very likely to involve the use of the full lexical NP, especially when the lexical NP is terminological. Lexical NPs are also far less susceptible than nonlexical NPs to other types of variation in the use of modifiers. Modifying words and phrases can be inserted within a nonlexical NP but not, without a change of referent, within a lexical NP. Similarly, the precise words comprising a nonlexical NP can be varied without a change of referent, but usually not in a lexical NP. Variations either in the choice of some words or in the presence vs. absence of some words in terminological NPs reflect distinct terms, often differentia of a noun or NP head. In technical text, which is the sole concern of the remainder of this paper, lexical NPs are almost exclusively terminological. Accordingly, the above considerations suggest that variation in the form of an NP in repeated references to the entity it designates is a major textual difference in the uses of terminological vs. nonterminological NPs that can be exploited in building a terminology identification algorithm. 2 Consider, for example, the terminological unit word sense, as used in a paper analyzed in section 4 (Pustejovsky and Boguraev 1993). This is by far the most frequent technical term extracted from the paper. The construct occurs 49 times, in 42 sentences. In 33 instances it occurs in the form word sense; in 16 it is used simply as sense. The contexts of the reduced form are quite limited. Usually, it occurs when a nearby sentence, or even successive sentences, containing a reference to this construct use the form word sense(s); when the expression occurs more than once in a sentence is it more likely than not that the reduced form will be used, and very often this is along with the full form. It also occurs in the reduced form when it appears in other technical terms (sense selection), and when senses of a particular word are being discussed. 12 John S. Justeson and Slava M. Katz This difference applies primarily to multi-word terms. All 1-word NPs (nouns) are by definition lexical. The differences between lexical and nonlexical NPs discussed above involve variations in modifier usage. The primary variation involving 1-word NPs (nouns) is noun substitution (e.g. via synonyms, hypernyms, and hyponyms), to which both terminological and nonterminological nouns are subject, and the tendency of nonterminological NPs to avoid exact repetition is least pronounced in the shortest NPs. Accordingly, 1-word terminological NPs are less resistant to and nonterminological NPs less prone to variability in expression than are multi-word NPs of the corresponding types. Accordingly, the repetition of 1-word NPs i.e. of nouns does not provide as powerful a contrast between terminological and nonterminological NPs as does the repetition of multi-word NPs. It is primarily for this reason that multi-word NPs are the focus of our terminology identification algorithm, presented in section 3. As it happens, multi-word NPs constitute the majority of all terminological units in technical vocabularies, so this focus helps us to capture the majority of technical terms. There is also a restricted difference in the susceptibility to repetition of the entities referred to by terminological vs. nonterminological NPs. This difference is specific to the use of novel terminology, i.e. terms that are newly introduced and not yet widely established, or terms that are current only in more advanced or specialized literature than that with which the intended audience can be presumed to be familiar. Whether or not a novel technical term is used for the construct to which it refers, a discursive statement of the construct must be made at or near the first reference to it (or, with suitable indication, in a glossary) in cooperative discourse. If the context of this explanatory statement is the only one in which the construct is referenced, then the use of the term itself does little to advance the exposition. We expect that the use of novel terminology is most often justified by the convenience of its use in further instances, and probably in fact in more than one paragraph. Established terminological NPs, such as semantic load or binary tree, may but need not be repeated in a text. But when an entity designated by such an NP is a topic of significant discussion within a text, that entity is almost certainly repeated; as previously discussed, terminological NPs tend to be repeated intact on repeated references to the entities they designate. Accordingly, established, topically significant terminological NPs do tend to be repeated in a text. Nontopical terminological NPs may or may not be repeated; nonrepeated terminological NPs are mostly nontopical. Some nonterminological NPs behave much like terminological NPs. Such NPs are likely to be repeated, word for word, only as a way of aiding recognition that the reference is to the same construct that was designated earlier. Arguably, however, such uses are effectively coinages of intentionally temporary terms for nonstandard constructs. 2 Structure of technical terms The previous section describes a pattern of constraints on the uses of terminological NPs. It is generally recognized that terminological NPs differ also in structure, at least statistically, from nonlexical NPs. This recognition is embodied in the observation Technical terminology 13 that technical jargon makes heavy use of noun compounds. Based both on general considerations and on empirical study of terminology in technical vocabularies, we propose a specific set of structural constraints on terminological NPs that hold in so high a proportion of cases as to be useful for automatic terminology identification. The structures of technical terms can be illustrated by sampling from available sources for different domains. We selected dictionaries of technical terminology in fiber optics (Weik 1989), medicine {Blakiston's Gould 1984), physics and mathematics (Lapedes 1978), and psychology (English and English 1958). From each dictionary, we extracted random samples of 200 technical terms. Noun phrases constitute 185 of the 200 medical and psychological terms, 197 of the mathematical terms, and 198 of the fiber optics terms, i.e. from 92.5% to 99.0% of the terms in each domain. Of the 35 non-NPs among these 800 terms, 32 are adjectives and 3 are verbs. We then extracted additional terms at random until 200 noun phrase terms had been extracted from each dictionary. Out of these 800 NP terms, 564 have more than one word and thus might have words other than nouns. Not one of these 564 terms has either a determiner or an adverb; only 2 have a conjunction (and); and just 17 have a preposition (in 15, this preposition is of). Thus, 97% of multi-word terminological NPs in these sources consist of nouns and adjectives only, and more than 99% consist only of nouns, adjectives, and the preposition of. This prevalence of noun phrases containing only nouns and adjectives follows from generalizations concerning the typical structures of technical semantic domains. Such domains are organized largely as taxonomies. Ethnolinguistic investigations have established that the terms for taxonomic categories are quite regular in structure (Berlin, Breedlove, and Raven 1973). Those at a level that can be considered a 'basic' or 'generic' level for discourse in the field tend to consist of a single word, or of a single word and a modifier. Furthermore, single words in general vocabulary are rarely appropriate for technical usage in a more specialized meaning because they are thereby inherently ambiguous; when native English forms are used to create new terms, it most often takes at least two words to adequately specify a meaning, and when this is done they usually have just one meaning and are relatively transparent semantically. Often, well established one-word terms are Greek or Latin forms made up of more than one root, e.g. aerodynamics; these would often be multi-word terms had they been based on English forms (air flow). Daughter nodes of a taxonomy are normally labelled by a term of the same complexity, or by one including one additional modifier; the typical form is the label for the mother node plus a modifier. Furthermore, modifiers applied to the label for one taxon in designating a more specific level are also often applied to other taxons in designating their differentia, leading from hierarchical toward paradigmatic (cross-classificational) structure. As a result of these trends, 2-word terms are the modal length in systems that have been subject to thorough investigation. We find the same in our dictionary samples. Overall, the average length of NP terms in these samples is 1.91; individual dictionaries provide values ranging from a low of 1.78 for medical terms to a high of 2.08 for fiber optics terms. In the typical distribution of term length, the number of 2-word terms is substantially larger than the number of 1-word terms, with the 14 John S. Justeson and Slava M. Katz Table 1. Frequencies ofNP terms of different lengths in samples from four domains. (Only 3 out of 800 terms have more than 4 words; none has more than 6 words.) Term length (in number of words)
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملApplications of term identification technology: domain description and content characterisation
The identification and extraction of technical terms is one of the better understood and most robust natural language processing (NLP) technologies within the current state of the art of language engineering. In generic information management contexts, terms have been used primarily for procedures seeking to identify a set of phrases that is useful for tasks such as text indexing, computational...
متن کاملTerminology of Combining the Sentences of Farsi Language with the Viterbi Algorithm and BI-GRAM Labeling
This paper, based on the Viterbi algorithm, selects the most likely combination of different wording from a variety of scenarios. In this regard, the Bi-gram and Unigram tags of each word, based on the letters forming the words, as well as the bigram and unigram labels After the breakdown into the composition or moment of transition from the decomposition to the combination obtained from th...
متن کاملProposal for A Framework for the High-Precision Identi cation of Linguistic Relationships
Current research in Information Retrieval and Information Extraction demands high-precision syntactic and semantic information from natural language text. We propose a plan for developing a framework to identify, with high-precision, the linguistic relationships between pairs of words in natural language text. Related research is reviewed and preliminary results are given. In our plan we outlin...
متن کاملNatural scene text localization using edge color signature
Localizing text regions in images taken from natural scenes is one of the challenging problems dueto variations in font, size, color and orientation of text. In this paper, we introduce a new concept socalled Edge Color Signature for localizing text regions in an image. This method is able to localizeboth Farsi and English texts. In the proposed method rst a pyramid using diff...
متن کاملLanguage Features of Russian Texts of Engineering Discourse
The Article is devoted to the applied problem of identifying the linguistic features of engineering texts. The study of Russian-language texts of engineering discourse is usually of an applied nature, in our case, this applied research is caused by the need to teach foreigners who receive professional engineering education in Russia and in Russian language. The object of the research is the Rus...
متن کامل